First Portfolio Project: SAT Scores
Introduction¶
Purpose¶
The purpose of this notebook is to make a portfolio project to showcase some basic data science skills.
To do this a tutorial from Dataquest will be used.
This notebook will be a rehash of the Dataquest tutorial project. Then, in another, separate, notebook, a different project with the same format will be made.
Topic¶
The SAT scores of high-school students from New York City will be analysed, along with several demographic metrics.
The SAT is a test used in the United States to assess students' readiness for higher education. It is required to be allowed entry to many universities, so students need to do well in it.
The range of SAT scores has changed several times. For the datasets used in this notebook, the maximum score es 2400. The average SAT score of high schools is often used to rank them.
There have been claims of race or gender bias in the SAT, so this analysis will try to look into these claims.
Basic datasets¶
NYC Open Data publishes data about New York City across different categories. In this case, the interest is in education data.
The basic datasets to be used are:
- 2012 SAT results: School-level results for New York City on the SAT
- 2014 - 2015 DOE High School Directory: Directory of NYC High Schools
Additional datasets¶
To enhance the data from the basic datasets, several others will be added:
- 2006 - 2011 NYS Math Test Results By Grade - School Level - All Students: Results on the New York State Mathematics Tests, Grades 3 - 8
- 2010-2011 Class Size - School-level detail: Average class sizes for each school, by grade and program type
- 2010 AP (College Board) School Level Results: School level College Board AP results for 2010. Some universities in the US grant credits based on AP test scores
- 2005-2010 Graduation Outcomes - School Level: Graduation Outcomes - Cohorts of 2001 through 2006 (Classes of 2005 through 2010)
- 2006 - 2012 School Demographics and Accountability Snapshot: Annual school accounts of NYC public school student populations served by grade, special programs, ethnicity, gender and Title I funded programs
- 2011 NYC School Survey: Every year, all parents, all teachers, and students in grades 6 - 12 take the NYC School Survey
- School Districts: GIS data: Boundaries of School Districts: Layout of school districts
Background information¶
Before beginning the analysis, it's useful to have some background information to provide context for the data:
- New York City is divided into five boroughs
- NYC is divided up into 32 geographic districts, with a few tens of schools each
- The datasets have schools that are not high schools, so data cleaning will be necessary
- Every public school in New York City is assigned a unique
DBN“District Borough Number” - By computing district-level metrics, the GIS (Geographic Information System) data can be used to plot maps and show geographical differences
Data ingestion¶
The first part of the process is ingesting the data, or reading it in from the different sources (tabular files in this case).
To do this, the following code will:
- Loop through the files
- Read each file into a Pandas DataFrame
- Put each DataFrame into a Python dictionary
# Import pandas, numpy, and IPython's Markdown, pyplot and seaborn
import pandas as pd
import numpy as np
from IPython.display import Markdown
import matplotlib.pyplot as plt
import seaborn as sns
# Make a list of the files to be imported
files={
# AP (College Board) Results
'ap_2010':'https://data.cityofnewyork.us/api/views/itfs-ms3e/rows.csv',
# Class Size
'class_size':'https://data.cityofnewyork.us/api/views/urz7-pzb3/rows.csv',
# School Demographics and Accountability
'demographics':'https://data.cityofnewyork.us/api/views/ihfw-zy9j/rows.csv',
# Graduation Outcomes
'graduation':'https://data.cityofnewyork.us/api/views/vh2h-md7a/rows.csv',
# High School Directory
'hs_directory':'https://data.cityofnewyork.us/api/views/n3p6-zve2/rows.csv',
# NYS Math Test Results By Grade
'math_test_results':'https://data.cityofnewyork.us/api/views/jufi-gzgp/rows.csv',
# SAT results
'sat_results':'https://data.cityofnewyork.us/api/views/f9bf-2cp4/rows.csv',
}
# Create an empty dictionary to hold each DataFrame
data={}
# Loop throught files
for k,v in files.items():
# Read each file into a DataFrame
d=pd.read_csv(v)
data[k]=d
Once the data has been read, the .head() method can be used on the data dictionary to print the first five rows of each DataFrame.
# Loop through the keys and values of the dictionary
for k,v in data.items():
# Print the name of each DataFrame, followed by its first five rows
display(Markdown('##{}'.format(k)))
display(v.head())
Some useful details can be noted from the previous output:
- Most of the datasets contain a
DBNcolumn, which can be used to uniquely identify each school - There is information that could be used to make a map (
Location 1column) - In some datasets there are multiple rows for each school, so preprocessing will be necessary
Combining the datasets¶
To make working with the data easier, all the individual datasets will be combined into a single one. To do this, a column that exists in every dataset must be found. From what it was noted above, DBN might be that column.
Two of the datasets, hs_directory and class_size, don't contain a DBN column.
hs_directory dataset¶
In this case, the column exists, but it is called dbn, so it is only necessary to change its name to DBN.
# Change dbn column to DBN
data['hs_directory']['DBN']=data['hs_directory']['dbn']
class_size dataset¶
In the rest of the datasets, the DBN column looks like the following output.
display(data['demographics']['DBN'].head())
But, looking at the five first class_size rows, no DBN column can be seen.
display(data['class_size'].head())
From the previous output, it can be observed that the DBN is formed with the CSD, BOROUGH, and SCHOOL CODE columns combined.
Looking at the dictionary of the dataset, the following description can be obtained:
DBNstands forDistrict, Borough, School Number, and every school in the system can be identified by this 6-digit codeThe first two numbers represent the school district (
CSD,Community School District)The third character signifies the borough (
BOROUGH) in which the school is located (M = Manhattan, X = Bronx, R = Staten Island, K = Brooklyn and Q = Queens)The final three digits represent the
SCHOOL CODEand are unique within the borough
So the DBN column can be constructed for the class_size dataset.
# Combine CSD, BOROUGH and SCHOOL CODE columns to form DBN
# 02d formats an integer (d) to a field of minimum width 2 (2), with zero-padding on the left (0)
data['class_size']['DBN']=data['class_size'].apply(lambda x:'{0:02d}{1}'.format(x['CSD'],x['SCHOOL CODE']),axis=1)
# Print the first five rows of class_size
display(data['class_size'].head())
School surveys¶
The NYC School Survey is taken every year by parents, teachers and students. It is meant to provide insight into the learning environment of schools, giving access to metrics that go beyond simple test scores.
The questions target the community opinions on academic expectations, communication, engagement, and safety and respect.
In 2011, 960,191 surveys were submitted.
The following code will:
- Read the surveys for all schools and the surveys for district 75
- Add a flag that indicates from which school district is each dataset
- Combine the datasets
from zipfile import ZipFile
import urllib.request
import shutil
# Download the file from "url" and save it to "file_name"
url = 'https://data.cityofnewyork.us/api/views/mnz3-dyi8/files/220be57a-7c05-48dc-94de-b11a882ca9da?download=true&filename=2011%20School%20Survey.zip'
file_name='survey_file.zip'
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
# Extract all files in the zip
zip_file = ZipFile(file_name).extractall()
# Read the surveys for all schools and the surveys for district 75
# Use tab as delimiter and windows-1252 as file encoding
survey_all=pd.read_csv('2011 data files online/masterfile11_gened_final.txt',delimiter='\t',encoding='windows-1252')
survey_d75=pd.read_csv('2011 data files online/masterfile11_d75_final.txt',delimiter='\t',encoding='windows-1252')
# Add a flag, True if the school district is 75, False otherwise
survey_all['d75']=False
survey_d75['d75']=True
# Concatenate the datasets into a single DataFrame
survey=pd.concat([survey_all,survey_d75],axis=0)
# Print the first five rows of survey
display(survey.head())
Looking at the columns¶
The survey DataFrame has 2773 columns. To be able to easily compare columns and calculate correlations, this number should be reduced.
The survey data came with a data dictionary (in spreadsheet format) that explains the meaning of each column.
display(pd.read_excel('2011 data files online/Survey Data Dictionary.xls'))
Not all columns will provide useful information for the analysis, so only the important ones will be kept:
DBN: District, Borough, School Numberrr_s: Student Response Raterr_t: Teacher Response Raterr_p: Parent Response RateN_s: Number of student respondentsN_t: Number of teacher respondentsN_p: Number of parent respondentssaf_p_11: Safety and Respect score based on parent responsescom_p_11: Communication score based on parent responseseng_p_11: Engagement score based on parent responsesaca_p_11: Academic expectations score based on parent responsessaf_t_11: Safety and Respect score based on teacher responsescom_t_11: Communication score based on teacher responseseng_t_11: Engagement score based on teacher responsesaca_t_11: Academic expectations score based on teacher responsessaf_s_11: Safety and Respect score based on student responsescom_s_11: Communication score based on student responseseng_s_11: Engagement score based on student responsesaca_s_11: Academic expectations score based on student responsessaf_tot_11: Safety and Respect total scorecom_tot_11: Communication total scoreeng_tot_11: Engagement total scoreaca_tot_11: Academic Expectations total score
# Print the shape of the survey DataFrame
display(Markdown('The `survey` DataFrame has a shape of {} before removing unwanted columns'.format(survey.shape)))
# Change dbn column to DBN
survey['DBN']=survey['dbn']
# Make a list of the columns to be kept
survey_fields=[
'DBN',
'rr_s',
'rr_t',
'rr_p',
'N_s',
'N_t',
'N_p',
'saf_p_11',
'com_p_11',
'eng_p_11',
'aca_p_11',
'saf_t_11',
'com_t_11',
'eng_t_11',
'aca_t_11',
'saf_s_11',
'com_s_11',
'eng_s_11',
'aca_s_11',
'saf_tot_11',
'com_tot_11',
'eng_tot_11',
'aca_tot_11',
]
# Keep only the columns from the survey_fields list
survey = survey.loc[:,survey_fields]
# Add the survey DataFrame to the data dictionary
data['survey'] = survey
# Print the first five rows of survey and its shape
display(survey.head())
display(Markdown('The `survey` DataFrame has a shape of {} after removing unwanted columns'.format(survey.shape)))
Understanding what each dataset contains, and which columns in each dataset are the relevant ones is crucial to avoid wasting time and effort later on, during the analysis phase.
Condensing datasets¶
If we again look at the first rows of each dataset, we'll see that in many cases there are several rows for each high school. But in the sat_results dataset, there is only one row per high school.
# Loop through the keys and values of the dictionary
for k,v in data.items():
# Print the name of each DataFrame, followed by its first five rows
display(Markdown('##{}'.format(k)))
display(v.head())
To be able to compare SAT scores with different variables, there should be only one row per high school.
Condensing class_size¶
In this dataset, the GRADE and PROGRAM TYPE columns have multiple values for each high school. By restricting each field to a single value, most of the duplicate rows can be filtered out.
The following code will:
- Only select values from
class_sizewhere theGRADEcolumn equals09-12(high school grade 9th through 12th) - Only select values from
class_sizewhere thePROGRAM TYPEcolumn equalsGEN ED(restrict values to general education programs only) - Group
class_sizebyDBN, then find the average class_size values for each school
# Create a class_size DataFrame for ease of manipulation
class_size=data['class_size']
# Only select values from class_size where the GRADE column equals 09-12
class_size=class_size[class_size['GRADE ']=='09-12']
# Only select values from class_size where the PROGRAM TYPE column equals GEN ED
class_size=class_size[class_size['PROGRAM TYPE']=='GEN ED']
# Group class_size by DBN, then take the average by columns (find the average class_size values for each school)
class_size=class_size.groupby('DBN').agg(np.mean)
# Reset the index so DBN is a column again
class_size.reset_index(inplace=True)
# Replace the class_size DataFrame in the data dictionary by the class_size DataFrame just created
data['class_size']=class_size
Condensing demographics¶
The dataset contains data for several years, specified by the schoolyear column. Only the rows for the most recent year will be kept (20112012).
demographics=data['demographics']
demographics=demographics[demographics['schoolyear']==20112012]
data['demographics']=demographics
Condensing math_test_results¶
This dataset is segmented by Grade and Year. Only the data for the most recent year and highest grade will be kept.
data['math_test_results']=data['math_test_results'][data['math_test_results']['Year']==2011]
data['math_test_results']=data['math_test_results'][data['math_test_results']['Grade']=='8']
Condensing graduation¶
In this case, the data used will be the one for the class of 2010 (Cohort=2006) and with metrics calculated across de entire cohort (Demographic=Total Cohort).
data['graduation']=data['graduation'][data['graduation']['Cohort']=='2006']
data['graduation']=data['graduation'][data['graduation']['Demographic']=='Total Cohort']
Resulting data¶
Data cleaning is an import step to avoid arriving at nonsense conclusions (to avoid the outcome of the old adage: garbage in, garbage out).
The resulting data is presented by the following code.
# Loop through the keys and values of the dictionary
for k,v in data.items():
# Print the name of each DataFrame, followed by its first five rows
display(Markdown('##{}'.format(k)))
display(v.head())
# List of SAT score columns from the sat_results DataFrame
cols=['SAT Critical Reading Avg. Score','SAT Math Avg. Score','SAT Writing Avg. Score']
# Convert each column to numeric and add the columns to get the total SAT score sat_score
data['sat_results'][cols]=data['sat_results'][cols].apply(pd.to_numeric,errors='coerce')
data['sat_results']['sat_score']=data['sat_results'][cols].sum(axis=1)
display(data['sat_results'])
Geographic coordinates¶
The GIS data from each school can be used to display them in a map. This information is contained in the Location 1 column from the hs_directory dataset.
The typical value of the Location 1 column is formed by three lines:
284 Baltic Street
Brooklyn, NY 11201
(40.685451806, -73.993491465)
The third row contains the latitude and longitude coordinates for the school. These values will be obtained using string splitting.
# Split each row of Location 1 on line feeds, get the last element (coordinaste), replace parentheses by empty strings, split result on commas, get the first (lat) and second (lon) elements
data['hs_directory']['lat'] = data['hs_directory']['Location 1'].apply(lambda x: x.split('\n')[-1].replace('(', '').replace(')', '').split(', ')[0])
data['hs_directory']['lon'] = data['hs_directory']['Location 1'].apply(lambda x: x.split('\n')[-1].replace('(', '').replace(')', '').split(', ')[1])
# Convert new columns to numeric
data['hs_directory'][['lat','lon']]=data['hs_directory'][['lat','lon']].apply(pd.to_numeric,errors='coerce')
The result of computing the new variables is shown below.
for k,v in data.items():
display(Markdown('##{}'.format(k)))
display(v.head())
Combining the datasets¶
In this stage the different datasets will be combined, using the DBN column.
Not all datasets have the same amount of rows (there exist high schools in some datasets that aren't present in the rest). When joining, it's important to not lose any data.
Given that the interest is in SAT scores, any high schools that don't exist in the sat_results will be ignored. To do this, the joins will all be left joins, with the left table being sat_results.
The below code will:
- Loop through each dataset
- Perform a left join
full=data['sat_results']
for k,v in data.items():
if k!='sat_results':
full=full.merge(v,on='DBN',how='left')
full.shape
Deal with duplicates¶
It's important to deal with duplicate values so that they don't distort the metrics to be calculated.
In this case, there is only one duplicate row (possibly a typo), so removing the second instance of it (as it contains NA values) solves the problem.
# Check for duplicates
display(full[full['DBN'].duplicated(keep=False)])
# Drop duplicates, keep first occurrence
full.drop_duplicates(subset='DBN',keep='first',inplace=True)
display(full.shape)
AP exam results¶
To be able to perform computations with the AP exam results, the data must be first converted to numeric form. Also, any missing values should be filled in (in this case, with 0, indicating that the AP exam wasn't taken).
AP_cols=['AP Test Takers ','Total Exams Taken','Number of Exams with scores 3 4 or 5']
full[AP_cols]=full[AP_cols].apply(pd.to_numeric,errors='coerce')
full[AP_cols]=full[AP_cols].fillna(value=0)
Create a school district columns¶
As it was explained before, the first two numbers of the DBN represent the school district of the school. This number can be used to calculate district-level metrics and plot them in a map.
# Create a school_dist column that contains the first two characters from the DBN column
full['school_dist']=full['DBN'].apply(lambda x:x[:2])
Imputation¶
To be able to perform many computations, the data must not have any missing values.
To accomplish this, imputation will be used. Imputation is the process of replacing any missing value with another value. In this case, the mean of the column will be used.
# Show quantity of missing values per column
with pd.option_context('display.max_columns', 500):
display(pd.DataFrame(full.isna().sum()).transpose())
# Drop columns with ONLY NA values
full.dropna(how='all',inplace=True,axis=1)
display(full.shape)
# Fill NA values with the mean of each column
full.fillna(full.mean(),inplace=True)
display(pd.DataFrame(full.isna().sum()).transpose())
In the above output, it can be observed that there are many columns which still have missing values. This is because imputation was performed only on numeric columns.
If instead, only missing values in numeric columns are shown, it can be seen that there aren't missing values for any of them.
# Show quantity of missing values per numeric column
with pd.option_context('display.max_columns', 500):
display(pd.DataFrame(full.select_dtypes(include=['number']).isna().sum()).transpose())
There are many missing values in the non-numeric columns. In this simple analysis, they will be left as is, but the issue should be dealt with if they are going to be used (for example, by replacing the missing values with the most frequent value, if that makes sense).
# Show quantity of missing values per non-numeric column
with pd.option_context('display.max_columns', 500):
display(pd.DataFrame(full.select_dtypes(exclude=['number']).isna().sum()).transpose())
Finally, as the interest is in SAT scores, any row with a score of zero will be dropped.
full.drop(full[full['sat_score'] ==0].index, inplace=True)
display(full.shape)
Show the resulting dataset¶
The first twenty rows of the dataset are shown next.
with pd.option_context('display.max_columns', 500,'display.max_rows', 500):
display(full.head(20))
display(full.shape)
Computing correlations¶
As a basic metric of the relationships between columns, their correlations can be computed. The correlation is the statistical relationship between two variables. It is measured by the correlation coefficient.
In this case, the Pearson correlation coefficient (or Pearson's r) will be used. This coefficient can take values between -1 and 1. A value of 0 means no correlation, the closer to 1 the stronger the positive correlation, and the closer to -1 the stronger the negative correlation.
The .corr() method, by default, calculates the Pearson correlation coefficient for all numeric columns in the dataset, ignoring missing values.
The code below shows the Pearson correlation coefficient for each column against the sat_score column.
with pd.option_context('display.max_columns', 500,'display.max_rows', 500):
display(full.corr()['sat_score'])
The output gives several insights worth exploring:
- Surprisingly,
total_enrollmentcorrelates positively with SAT scores. It could be thought that a smaller school would focus more on students, and give higher scores. - English language learners percentage (
ell_percent) is negatively correlated with SAT scores. - The correlations between SAT scores and survey responses are all rather small.
- There is significant racial inequality in SAT scores (
asian_per,black_per,hispanic_perandwhite_percolumns). - The percentage of male students (
male_per) correlates negatively with SAT scores, and the percentage of female students (female_per) correlates positively.
Putting the schools on the map¶
A good way to set the context for the problem that is being analysed is with charts or maps.
In this case, the location of each school will be plotted on a NYC map.
The maps will be constructed using Folium, a library that takes advantage of the data wrangling versatility of Python and the mapping strengths of Leaflet.
The code below will:
- Create a base map for NYC.
- Make a marker cluster, and add it to the base map.
- Create a marker for each school, and add them to the marker cluster.
- Save the map as an
.htmlfile. - Show the map.
import folium
from folium import plugins
# Create a base map, centered in the mean of the coordinates for all schools
schools_map = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
# Create a marker cluster, and add it to the base map
marker_cluster = folium.plugins.MarkerCluster().add_to(schools_map)
for name, row in full.iterrows():
# Create a marker for each school, and add it to the marker cluster
folium.Marker([row['lat'], row['lon']], popup='{0}: {1}'.format(row['DBN'], row['school_name'])).add_to(marker_cluster)
schools_map.save('schools.html')
schools_map
Making a heat map¶
The previous map is useful, but the school concentration is hard to notice. This can be alleviated with the use of a heat map.
# Create a base map, centered in the mean of the coordinates for all schools
schools_heatmap = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
# Create a heatmap layer, and add it as a child to the base map
schools_heatmap.add_child(plugins.HeatMap([[row['lat'], row['lon']] for name, row in full.iterrows()]))
schools_heatmap.save('heatmap.html')
schools_heatmap
District-level mapping¶
Heat maps allow for the visualisation of gradients but aren't very useful to show differences across geographical boundaries (like districts).
So there will be necessary to compute metrics by district, in the following way:
- Group
fullby district. - Compute the mean of each column by district.
- Reset the index and use the default one.
- Remove leading zeros.
district_data = full.groupby('school_dist').agg(np.mean).reset_index()
# Remove leading zeros
district_data['school_dist'] = district_data['school_dist'].apply(lambda x: str(int(x)))
In the code below there is a function to plot district-level metrics. In this case, it will be used to plot the average SAT score per district.
To be able to do this, the district.geojson dataset will be used. This dataset contains GeoJSON data of the boundaries of each district.
def show_district_map(col):
geo_path = 'https://data.cityofnewyork.us/api/geospatial/r8nu-ymqj?method=export&format=GeoJSON'
# Create a base map
districts = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10)
# Create a choropleth map, and add it the base map
folium.Choropleth(
geo_data=geo_path,
name=col,
data=district_data,
columns=['school_dist', col],
key_on='feature.properties.school_dist',
fill_color='OrRd',
fill_opacity=0.7,
line_opacity=0.2,
).add_to(districts)
# Add a control to select which layers to show on the map
folium.LayerControl().add_to(districts)
districts.save('districts.html')
return districts
show_district_map('sat_score')
Visualising correlations¶
In the following plots, the relationships between variables that were noticed when computing correlations will be visualised, to try to better understand the former.
SAT score vs enrolment¶
The relationship between SAT score and enrolment can be investigated using a scatter plot.
sns.set_theme()
_=sns.scatterplot(data=full,x='total_enrollment',y='sat_score')
There is a big cluster on the bottom left, with low total enrolment and low SAT scores.
Also, there seems to be a small positive correlation between SAT scores and total enrolment.
By making a list of schools, ordered by ascending SAT scores, this can be further explored.
#@title
display(full.sort_values(by='sat_score')['SCHOOL NAME'].head(10))
After a quick web search, it is found that most of these schools are for students who are learning English. This seems to indicate that it is not total enrolment that is correlated with SAT scores, but rather the fact that students are English language learners.
SAT score vs English language learners percentage¶
To elucidate whether the previous conjecture is true, a scatter plot of SAT scores vs English language learners percentage (ell_percent) is made.
_=sns.scatterplot(data=full,x='ell_percent',y='sat_score')
There is a cluster with high ell_percent and low SAT scores on the bottom right of the plot.
By plotting the English language learners percentage by district, it can be compared to the previous map of SAT scores by district.
show_district_map('ell_percent')
As seen from both maps, districts with a high percentage of ELL tend to have lower SAT scores.
SAT scores vs survey scores¶
Intuitively, a strong correlation can be expected between survey scores and SAT scores. To check if this is true, the correlation between these variables can be plotted.
_=sns.barplot(
x='index',
y='sat_score',
data=full.corr()['sat_score'][['rr_s', 'rr_t', 'rr_p', 'N_s', 'N_t', 'N_p', 'saf_tot_11', 'com_tot_11', 'aca_tot_11', 'eng_tot_11']].reset_index()
)
_=plt.xticks(rotation=90)
Unexpectedly, the variables with the highest correlation are N_s, N_t, and N_p (the number of student, teacher, and parent respondents, respectively). The three of them correlate strongly with total enrolment, so they are likely biased by ell_learners.
The next metric that correlates most is rr_s, the response rate of students. Which makes sense, as a more engaged student corpus is more likely to answer the survey and more likely to do better on tests.
Next, saf_tot_11 correlates most, the safety and respect total score. This also makes sense, as a safer environment makes learning easier for students.
But none of the other metrics correlates substantially with SAT scores. This might indicate that there is some kind of problem with the questions being asked.
Race and SAT scores¶
Another angle to explore involves the relation of race to SAT scores. There was a significant difference between SAT scores across different races, which can be clearly seen by plotting them.
_=sns.barplot(
x='index',
y='sat_score',
data=full.corr()['sat_score'][['white_per', 'asian_per', 'black_per', 'hispanic_per']].reset_index()
)
_=plt.xticks(rotation=90)
Apparently, higher percentages of White and Asian students correlate with higher SAT scores, and higher percentages of Black and Hispanic students correlates with lower SAT scores.
It can be hypothesised that, for Hispanic students, this is caused because they are more recent immigrants, so they are English language learners.
A Hispanic percentage by district map can help shed some light on this.
show_district_map('hispanic_per')
There seems to be a correlation between Hispanic students percentage and English language learners, but the topic needs to be analysed further.
Gender vs SAT scores¶
Finally, the correlation between gender and SAT scores will be explored. It was noticed that a higher percentage of female students correlates positively with SAT scores. This can be visualised with a bar plot.
_=sns.barplot(
x='index',
y='sat_score',
data=full.corr()['sat_score'][['male_per', 'female_per']].reset_index()
)
_=plt.xticks(rotation=90)
To further analyse this correlation, a scatter plot can be made.
_=sns.scatterplot(data=full,x='female_per', y='sat_score')
In the extreme left and right sides of the plot are located the gender-segregated schools (male and female, respectively).
Also, there is a cluster with a high percentage of females (greater than 65 %) and very high SAT scores. To try to understand the nature of this cluster, a list of these schools can be printed.
display(full[(full['female_per'] > 65) & (full['sat_score'] > 1400)].sort_values(by='SCHOOL NAME')['SCHOOL NAME'])
A web search reveals that these are exclusive schools, centred on the performing arts. These schools tend to have a higher percentage of females, and higher SAT scores.
AP scores¶
Now that the demographic angle has been analysed, a final relationship will be explored, the one between the proportion of students taking the AP exam and SAT scores. It can be expected that the correlation between these metrics is high, as students who take advanced courses tend to perform better on tests.
# Compute proportion of AP test-takers
full['ap_avg'] = full['AP Test Takers '] / full['total_enrollment']
_=sns.scatterplot(data=full,x='ap_avg', y='sat_score')
The plot shows that there seems to be a strong positive correlation between the two variables.
In the top right of the plot, there is a cluster with very high SAT scores and AP exam takers proportion. A list of these schools can be printed to examine them.
display(full[(full['ap_avg'] > .3) & (full['sat_score'] > 1700)].sort_values(by='SCHOOL NAME')['SCHOOL NAME'])
These seem to be highly selective schools, which require a test to get in, so it makes sense that they have a high proportion of AP test-takers.
Bringing the story to an end¶
In this notebook, a basic data science analysis was performed, bringing together different datasets, and going through the whole process workflow.
Several computations and visualisations were made to try to better understand the data. As a highlight, maps were used to help set up the context of the analysis.
This was a first approach to data science, done following a tutorial, and it should serve as a basis to perform a more custom-made project.